NLP4NLP: Applying NLP to Scientific Corpora about Written and Spoken Language Processing

نویسندگان

  • Gil Francopoulo
  • Joseph-Jean Mariani
  • Patrick Paroubek
چکیده

Analyzing the evolutions of the trends of a scientific domain in order to provide insights on its states and to establish reliable hypotheses about its future is the problem we address here. We have approached the problem by processing both the metadata and the text contents of the domain publications. Ideally, one would like to be able to automatically synthesize all the information present in the documents and their metadata. As members of the NLP community, we have applied the tools developed by our community to publications from our own domain, in what could be termed a “recursive” approach. In a first step, we have assembled a corpus of papers from NLP conferences and journals for both text and speech, covering documents produced from the 60’s up to 2015. Then , we have mined our scientific publication database to draw a picture of our field from quantitative and qualitative results according to a wide range of perspectives: ranging from sub-domains, specific communities, chronology, terminology, conceptual evolution, re-use and plagiarism, trend prediction, novelty detection and many more. We provide here an account of the corpus collection and of its processing with NLP technology, indicating for each aspect which technology was used. We conclude on the benefits brought by such corpus to the actors of the domain and on the conditions to generalize this approach to other scientific domains. Conference Topics Methods and techniques, Citation and co-citation analysis, Scientific fraud and dishonesty, Natural Language Processing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust Extraction of Subcategorization Data from Spoken Language

Subcategorization data has been crucial for various NLP tasks. Current method for automatic SCF acquisition usually proceeds in two steps: first, generate all SCF cues from a corpus using a parser, and then filter out spurious SCF cues with statistical tests. Previous studies on SCF acquisition have worked mainly with written texts; spoken corpora have received little attention. Transcripts of ...

متن کامل

Concordancing for parallel spoken language corpora

Concordancing is one of the oldest corpus analysis tools, especially for written corpora. In NLP concordancing appears in training of speech-recognition system. Additionally, comparative studies of different languages result in parallel corpora. Concordancing for these corpora in a NLP context is a new approach. We propose to combine these fields of interest for a multi-purpose concordance for ...

متن کامل

Parsing Arabic Dialects

The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem ...

متن کامل

Cross-Domain and Cross-Language Porting of Shallow Parsing

English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issu...

متن کامل

Instant Annotations – Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora

Thepaper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out in Freiburg and in collaboration with Hamburg, Syktyvkar, Tromsø and Uppsala. Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015